Hierarchical Clustering

Data Preprocessing

In [1]:

dataset = read.csv('Mall_Customers.csv')
X = dataset[4:5]

In [3]:

head(X, 10)

Out[3]:

Using the dendrogram to find the optimal number of clusters

In [5]:

dendrogram = hclust(dist(X, method = 'euclidean'), method = 'ward.D')
# ward is used to minimize the variance within each clusters
plot(dendrogram,
     main = paste('Dendrogram'),
     xlab = 'Customer',
     ylab = 'Euclidian Distance')

Out[5]:

From the above dendrogram we can see that the maximum length verticle line whcih does not have any horizontal line is the one providing the number of clusters as 5.

Fitting hierarchical clustering to the Mall datset

In [7]:

hc = hclust(dist(X, method = 'euclidean'), method = 'ward.D')
y_hc = cutree(hc, 5)

In [9]:

head(y_hc, 10)

Out[9]:

Visualising the clusters

In [10]:

library(cluster)
clusplot(X,
         y_hc,
         lines = 0,
         shade = TRUE,
         color = TRUE,
         labels = 5,
         plotchar = FALSE,
         span = TRUE,
         main = paste('Clusters of customers'),
         xlab = 'Annual Income',
         ylab = 'Spending Score')
# for more see help(clusplot.default)

Out[10]:

The target customers should be the one with High Earning and High Spend. Here the datapoints inside cluster 1 are the customers with High Earning and High Spend

Hierarchical Clustering

Data Preprocessing

Using the dendrogram to find the optimal number of clusters

Fitting hierarchical clustering to the Mall datset

Visualising the clusters

Product

Resources

Company